Group: 1
Team Members:
1) Anirban Kar Chaudhuri (A0108517H)
2) Maradana Vijaya Krishna (A0178453W)
3) Putrevu Manoj Niyogi (A0213557E)
4) Sivasankaran Balakrishnan (A0065970X)
Produce Visualisations to understand importance of all predictor variables, as well as their
underlying data distribution.
That will enable us to determine the meanings and importance of various predictor variables in how they influence
prediction of churn customers.
Lastly, k-means clustering carried out to profile customers by their tenure and monthly charge.
1) Importing Data & Examing Data Types
2) Data Visualisation
3) Cluster Analysis Based Tenure and Monthly Charges
4) Boxplots Of Monthly Charges Against Categorical Predictors
#Importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd #visualization
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
churn_data=pd.read_csv("Telco-Customer-Churn.csv")
churn_data.info()
churn_data.TotalCharges = pd.to_numeric(churn_data.TotalCharges, errors='coerce')
churn_data.TotalCharges = churn_data.TotalCharges.fillna(method='ffill')
There are 11 missing values under 'TotalCharges' column.
churn_data.TotalCharges = pd.to_numeric(churn_data.TotalCharges, errors='coerce')
churn_data.isnull().sum()
churn_data.info()
churn_data.nunique() #Number of unique values for categorical variables
print(churn_data["Churn"].value_counts()/len(churn_data)*100)
#Separating catagorical and numerical columns
Id_col = ['customerID']
target_col = ["Churn"]
cat_cols = churn_data.nunique()[churn_data.nunique() < 6].keys().tolist()
cat_cols = [x for x in cat_cols if x not in target_col] #categorical predictor variables
num_cols = [x for x in churn_data.columns if x not in cat_cols + target_col + Id_col] #numerical predictor variables
References for plots:
Udemy course on plotly & dash: https://github.com/Pierian-Data/Plotly-Dashboards-with-Dash
Plotly website basic tutorials: https://plotly.com/python/line-and-scatter/#line-plot-with-plotly-express
Plotly colors: https://plotly.com/python/discrete-color/
#Separating churn and non churn customers
churn = churn_data[churn_data["Churn"] == "Yes"]
not_churn = churn_data[churn_data["Churn"] == "No"]
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go#visualization
def plot_pie(column) :
trace1 = go.Pie(values = churn[column].value_counts().values.tolist(),
labels = churn[column].value_counts().keys().tolist(),
hoverinfo = "label+percent+name",
domain = dict(x = [0,.48]),
name = "Churn Customers",
marker = dict(line = dict(width = 2,
color = "rgb(243,243,243)")),
hole = .6
)
trace2 = go.Pie(values = not_churn[column].value_counts().values.tolist(),
labels = not_churn[column].value_counts().keys().tolist(),
hoverinfo = "label+percent+name",
marker = dict(line = dict(width = 2,
color = "rgb(243,243,243)")
),
domain = dict(x = [.52,1]),
hole = .6,
name = "Non churn customers"
)
layout = go.Layout(dict(title = column + " distribution in customer attrition ",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
annotations = [dict(text = "churn customers",
font = dict(size = 13),
showarrow = False,
x = .15, y = .5),
dict(text = "Non churn customers",
font = dict(size = 13),
showarrow = False,
x = .88,y = .5
)
]
)
)
data = [trace1,trace2]
fig = go.Figure(data = data,layout = layout)
py.iplot(fig)
#for all categorical columns plot pie
for i in cat_cols :
plot_pie(i)
Inferences from pie chars:
1) Gender is not a good indicator of churn
2) Customers that doesn't have partners are more likely to churn
3) Customers without dependents are also more likely to churn
4) Customers who are on month-to-month contract are likely to abandon company services
5) Customers who have internet available, opt for paperless billing and automatic payment services are more
likely to churn. These groups of customers tend to be tech savvy, read widely and be updated on latest market
trends and rates.
6) Customers who enjoy premium stream services are likely to leave, if they are lured by competitors offering
similar services whose prices are competitive and offer better quality.
7) Customers also tend to leave because of lack of technical support and online security as they're unlikely to find success in a company's products.
8) Presence of phone service, especially multiple lines drive churn.
sns.heatmap(pd.crosstab(churn_data.Dependents, churn_data.Partner, normalize='all', margins=True), annot=True, cmap='ocean')
sns.heatmap(pd.crosstab(churn_data.Dependents, churn_data.SeniorCitizen, normalize='all', margins=True), annot=True, cmap='ocean')
Senior citizens have a higher probability of having dependents.
People without partners generally do not have dependents.
sns.heatmap(pd.crosstab(churn_data.PhoneService, churn_data.MultipleLines, normalize='all', margins=True), annot=True, cmap='ocean')
Those with phoneservices have equal probability of having mutliple phone lines.
Multiple lines is not actually a strong predictor.
sns.heatmap(pd.crosstab(churn_data.InternetService,churn_data.PaymentMethod, normalize='all', margins=True), annot=True, cmap='ocean')
sns.heatmap(pd.crosstab(churn_data.InternetService,churn_data.PaperlessBilling, normalize='all', margins=True), annot=True, cmap='ocean')
People who opt for paperless billing tend to utilise internet service.
Those with internet service have a clearcut preference for automatic transfers, especially Fiber optic subscribers.
Those without internet services tend to use mailed check mostly.
sns.heatmap(pd.crosstab(churn_data.SeniorCitizen,churn_data.PaymentMethod, normalize='all', margins=True), annot=True, cmap='ocean')
People generally prefer manual transfer, probably due to safety reasons as well as lower cost, compared to automatic transfers.
This is regardless of age.
#function for histogram for customer attrition types
def plot_histogram(column) :
trace1 = go.Histogram(x = churn[column],
histnorm = "percent",
name = "Churn Customers",
marker = dict(line = dict(width = .5,
color = "black"
)
),
opacity = .6
)
trace2 = go.Histogram(x = not_churn[column],
histnorm = "percent",
name = "Non churn customers",
marker = dict(line = dict(width = .5,
color = "black"
)
),
opacity = .6
)
data = [trace1,trace2]
layout = go.Layout(dict(title =column + " distribution in customer attrition ",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(300,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = column,
zerolinewidth=1,
ticklen=5,
gridwidth=2
),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "percent",
zerolinewidth=1,
ticklen=5,
gridwidth=2
),
))
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
#for all categorical columns plot histogram
for i in num_cols :
plot_histogram(i)
Inferences from histogram diagrams:
1) 39% of the churn customers have a tenure of about 5 months.
2) Churn customers have monthly charges peaked at around $75 per month.
3) Approximately 55% of churn customers have a cumulative total charge of 900 dollars
import plotly.express as px
def plotly_scatterplot(xc, yc, colour, template, trendline=None):
fig1 = px.scatter(churn_data, x=xc, y=yc,
color=colour, render_mode='svg', template=template,
hover_name="customerID",
marginal_x=None,
marginal_y=None, trendline=trendline)
return fig1
plotly_scatterplot(xc='MonthlyCharges', yc='TotalCharges', colour='Churn', template='plotly_dark',trendline='ols')
plotly_scatterplot(xc='MonthlyCharges', yc='TotalCharges', colour='Contract', template='plotly')
It is understood from the two scatterplots that:
1) Clients with lower tenure are more likely to churn
2) Clients with higher MonthlyCharges are also more likely to churn
3) Tenure and MonthlyCharges are very significant features in determining churn outcome
K-means clustering can be used to partition the dataset based on tenure and monthly charges, the significant numeric variables.
Purpose is to group instances of similar traits together.
The K in K-Means denotes the number of clusters.
This algorithm initialises cluster centroids that randomly converges to a solution after some point in time
is bound to converge to a solution after some iterations.
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
churn_data[['MonthlyCharges','tenure']] = scaler.fit_transform(churn_data[['MonthlyCharges','tenure']])
def elbow_plot(data=churn_data[['MonthlyCharges','tenure']]):
score = []
for cluster in range(1,11):
kmeans = KMeans(n_clusters = cluster, init="k-means++", random_state=10)
kmeans.fit(data)
score.append(kmeans.inertia_)
plt.plot(range(1,11), score)
plt.title('The Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('wcss')
plt.grid()
return plt.show()
elbow_plot()
• Inertia is the sum of squared error for each cluster. Therefore, the smaller the inertia the denser the cluster (closer together all the points are)
• Tip for choosing optimal number of clusters is looking at rate of decrease in inertia for addition of a cluster
• Optimal number of clusters is 4 since inertia does not decrease noticeably after additional clusters are added
#Apply kmeans clustering to the entire dataset
kmeans = KMeans(n_clusters = 4, random_state = 1000).fit(churn_data[['MonthlyCharges','tenure']])
churn_data['cluster'] = kmeans.labels_
churn_data[['MonthlyCharges','tenure']] = scaler.inverse_transform(churn_data[['MonthlyCharges','tenure']])
#Plot a plotly interactive scatter plot
fig = px.scatter(churn_data, x='MonthlyCharges', y='tenure',
color='cluster', render_mode='svg', template='plotly',
hover_data=['SeniorCitizen','Dependents','Contract','InternetService',
'PaperlessBilling','PaymentMethod'],
hover_name="customerID",
marginal_x="violin",
marginal_y="violin")
fig.update_layout(title='Clusters of churned users by monthly charges and tenure',
paper_bgcolor='LightBlue')
fig.show()
Overall, clusters are well segregated as seen above.
Cluster 0: High tenure, high monthly charge
Cluster 1: Low tenure, low monthly charge
Cluster 2: Low tenure, high monthly charge
Cluster 3: High tenure, low monthly charge,
The pivot table below shows mean monthly charges and tenure of senior citizens in a cluster.
The figures in the table can be verified by hovering the cursor over the interactive graph above.
cluster_charges = pd.pivot_table(churn_data, index=['cluster'], columns=['SeniorCitizen'],
values=['MonthlyCharges','tenure'], margins=True, aggfunc='mean')
sns.heatmap(cluster_charges, annot=True, cmap='ocean')
churn_data['cluster'].value_counts() #count distribution of clusters
sns.countplot('cluster',hue='Churn',data=churn_data, orient='h')
sns.countplot('Churn', hue='MultipleLines',data=churn_data, orient='h')
Overall, clusters are well segregated as seen above.
Cluster 0: High tenure, high monthly charge Cluster 1: Low tenure, low monthly charge Cluster 2: Low tenure, high monthly charge Cluster 3: High tenure, low monthly charge,
Clusters with descending order of churning probability: 2, 1, 0, 3
Cluster 3 is defined by high tenure and low, monthly charges, ideal for retaining customers.
Customers in category 2 (low tenure and high monthly charges), have highest probability of churning.
Cluster 0 customers have high tenure but high monthly charge. This shows monthly charge also an important predictor.
sns.countplot('cluster',hue='SeniorCitizen',data=churn_data, orient='h')
Senior citizens fall under clusters 0 and 2, clusters with high monthly charge. This means high monthly charge is problematic for senior citizens.
sns.countplot('cluster',hue='Contract',data=churn_data, orient='h')
Customers of over two year contracts are found in clusters with high tenure. Customers of month-to-month contract are found in clusters with low tenure and they cause churn signficantly.
Let's investigate monthly charge
def plot_boxplot(column):
fig = px.box(churn_data, x=column, y="MonthlyCharges", color="Churn",points="outliers",
hover_name="customerID",template='plotly')
fig.update_traces(quartilemethod="inclusive")
fig.update_layout(title='Monthly Charges against {} Segregated By Clusters'.format(column))
return fig.show()
for cols in ['MultipleLines','OnlineSecurity','StreamingTV','PaperlessBilling','PaymentMethod','SeniorCitizen']:
plot_boxplot(cols)
Boxplot Inferences:
1) Senior citizens tend to have higher cost monthly, even for those in churn groups. They are likely to churn but bring great benefits in revenue.
2) Manual payment through checks are less costlier. Mailed check payment has largest range.
3) Presence of internet increases monthly cost significantly. Addition of streaming cost poses greater costs.
4) Having phoneline increases cost. Multiple phonelines raises monthly cost.
5) High increases in costs due to internet and phone related services increases probability of churn.
churn_data.to_pickle("customer_churn_data.pkl")